The Client
A startup company from the Silicon Valley, USA with a product idea for delivering world-class software to animal shelters and rescues, in the USA, to help them work faster and more efficiently, so that they can save more animals with their existing infrastructure.
The Business Scenario
The client had already developed some POC features with help of some other developers, and wanted help from a professional software engineering team to develop it further into a top quality product based on feedback from users and industry experts.
The client had no software engineers in their team so they not only needed an affordable offshore technical team but also a lot of help in leading and nurturing that team. One major concern for them was ensuring the offshore team is kept fully engaged at the maximum productivity level. Ensuring this offshore technical team always has the best understanding of product requirements designed by business analysts working from various US locations was also a challenge.
The Problem Statement
Initially the client had tight budget constraints and could afford to pay only for two junior level engineers. The client also had an existing Drupal codebase with a few years worth of effort already invested in it by some other technical team. This gave rise to another challenge of ensuring that the appointed junior engineers can completely understand, properly handle, and gradually improve the existing code base which was also in a very complex and messy state. All of these were to happen without hampering the ongoing demos and other sales activities.
The Phased Approach
The project was kickstarted with a minimal team of one junior engineer and one senior engineer to lead the technical activities and to work as SPOC(with some pricing adjustments), so that the development cost could be kept low in the initial stages, but at the same time ensuring smooth communication and supporting positive progress in the internal structure of the product. It was also decided that they may add more engineers to the team or switch to more senior level persons later when their workload and fund flow increase. In this initial phase we retained the web hosting infrastructure in the shared VPS that was being used by the client at that time.
The project evolved gradually but handling all technical challenges in the best possible way within minimal cost. The product was still lacking a lot of features and had a lot of business problems left to solve. However, the product owner in the US was able to figure out simple solutions for the business problems and the technical team at Livares was able to grasp ideas from them and implement the solutions with minimal delays. The speed with which the composite-product-team(in-house+offshore) was able to respond to user feedback created many happy users, and even some ambassadors for the product from among them.
Some big users came onboard and started using the product. More users meant more feature requests, feedbacks, and issues found per day. This was giving rise to an increase in workload for engineers, along with the increase in server load due to increased user activity. Thus we knew it was time to move the project to the next phase.
Phase II
The client wanted to add more engineers to the team to handle the increasing workload. We were anticipating such a surge in workload and therefore we had prepared some engineers in our talent pool pre-trained to start working for this client. So Livares was immediately able to add a few new engineers to their team for satisfying urgent requirements. Thereafter Livares also handled talent hunt, onboarding, and training process for building a passionate and highly motivated offshore engineering team for this client.
To prepare for the increasing user activity we decided to migrate the application to another hosting platform. Many hosting services were evaluated and we found that a PaaS platform - Pantheon - is most suitable for our infrastructure/DevOps requirements and anticipated load for the next few years in this phase. This platform also gave us the required infrastructure to establish a clean and more standardized structure for the DevOps workflows for this project. The client also received angel funding for this product from a famous animal rescue NGO which helped to fund this infrastructure upgrade.
Engineers at Livares were able to do this migration to the new PaaS platform with minimal downtime. The migration also involved conversion of the version control system of the project from Mercurial(Hg) to Git.
The product user base continued to grow and as the product’s feature-set, and value propositions evolved and customer-base expanded further, we started seeing a stronger flow of change requests from the sales & marketing team in the US, and we sensed a need for increase in team velocity. We knew that we have to do more frequent releases without compromising quality or stability of the app. To tame this challenge we revised our development process and also introduced more automation in the workflow. The team adopted a more standardized feature-branch branching model for managing git repository, and our git workflow was also revised to a process similar to the well known Gitflow. A Continuous Integration / Continuous Deployment(CI/CD) workflow was adopted beyond this stage. Jenkins was used for automating the CI/CD pipeline.
As the user base continued to grow fast, we often didn’t have the luxury of scaling the infrastructure up to handle the increasing server load. So there was an ever present challenge to continuously fine-tune and optimize the existing infrastructure as well as the code: to make it run faster within the available infrastructure. Mastering New relic(Application Performance Monitoring tool) has proven very helpful for our software engineers to identify the code and DB tables/queries that need to be optimized to maximize ROI of these optimization efforts.
In this phase we had also integrated Pingdom to provide down-time alerting, and NewRelic to generate reports about performance degradation. Thus we could learn about the shortcomings of the product even before a user reports it. This made our response to user feedback appear faster, because many of the issues were already in the development pipeline by the time some user feedback came in. We also hired a highly experienced DB optimization expert to help us with optimizing some very complex queries we had to run, and also to improve the overall DB architecture, performance, and reliability.
Phase III
After some time we came to a stage where the client suddenly started getting a lot of new customers and we also started planning to cater to a whole new category of organizations that process tens of thousands of animals per month. This was going to bring a big shift in expected server load. This called for an upgrade as well as restructuring of the infrastructure. To deal with this challenge we needed more control over the infrastructure. So we decided to move out from the infrastructure provided by the PaaS and set up an infrastructure of our own with a load-balanced auto-scaling cluster of compute nodes. After some evaluation we decided to use Amazon Web Services as the base for our infrastructure. By this time the client was also able to raise a VC funding which gave us financial support to take up the costs involved in migrating to our own infrastructure.
First all DNS records were moved to Amazon Route 53. The Amazon EC2 instances spun up using Amazon Machine Images were used to host our application servers. The EC2 instances were put into an auto-scaling group and placed behind a load balancer created using Amazon ELB. For each environment, an Amazon RDS instance with read-replica was created to house the application database. The read replica would help share some read traffic coming into the database. This database would be handling hundreds of complex queries per minute for generating highly customized reports built by customer organizations using the custom report builder feature we have built into this application. Amazon ElastiCache(Redis) was used as a backend caching layer for the application. All these infrastructure components - Elastic Load Balancers, EC2 instances, RDS instances, ElastiCache servers - were aliased with domain names, which is a DNS best practice. Finally the whole setup was modeled as a AWS CloudFormation template to make it easier to do maintenance, modifications, or replicate in different regions.
Amazon S3 was used as the file system for the application. It would store all the static assets, user uploaded photos etc, and the cloudfront CDN was integrated for serving them efficiently with thumbnailing, and caching, while also taking away all that load from the application server. Amazon S3 was also used for holding our deployment bundles, backups etc.
The CI/CD pipeline was also fully automated. The branches in git origin are hooked to the continuous-integration jobs in the Jenkins CI server, that are integrated to AWS CodeBuild via a plugin. AWS CodeBuild can build a deployable bundle as gzipped-tarball, and store in Amazon S3. After the build job completes AWS CodeDeploy is triggered automatically to get the bundle from Amazon S3 and perform deployment into the correct environment associated with the git branch. Changes in the development branch would cause deployment into the development environment, changes on any release branch would cause deployment into the test environment. AWS Lambda was used in several places to do small bits of computations, running scripts, integrating data from other APIs etc.
After all these infrastructure setup, we were ready to migrate from old PaaS infrastructure to the new AWS home. Moving out from PaaS to a custom infrastructure also meant the team will have to add more DevOps persons. We were able to train some engineers in our existing team to quickly handle some urgent DevOps tasks in the short term such as responding to incidents, checking logs, CloudWatch and other monitoring tools etc. A highly experienced DevOps expert was also hired to help with migration to the new infrastructure and to ensure that all the AWS puzzle pieces fit in place correctly.
The migration to new infrastructure was done in several steps. First we executed trial migrations for our development and test environments as a POC. This helped us find and fix some gaps in our migration process and gain more insights about potential challenges for the production environment migration. The production environment migration went completely smooth and perfect. Following the migration we had to spend a few days focusing on fine-tuning the new infrastructure configurations to optimize it for the user activity patterns to get best performance as well as minimizing costs by adjusting the EC2 instance types etc.
Along with the migration of the application to the in-house infrastructure, the client also started planning a transition to a completely in-house engineering team to comply with the demands of their VC. The hiring was mostly done directly by the client. The documents and resources for training and knowledge transfer were completely provided by our engineering team at Livares.
Technologies Used
PHP(Started from v5.5, and gradually upgraded all the way up to v7.1) using PHP-FPM MySQL(Initially MySQL 5.5 and gradually upgraded up to MySQL 5.7 equivalent of MariaDB) Shell scripting & Server Configuration Management: Bash, Ansible.
- Frameworks:Drupal-7, Laravel, Zend Expressive, PHPUnit
- Dependency management:Composer
- HTTP Server:Nginx
- Caching Layers:Varnish(Edge Catching), Redis, Gearman Job Server
- Amazon Web ServicesCloudFormation, Route 53, AMI, ELB & EC2 with auto-scaling, RDS, AWS Lambda, AWS VPC, Amazon ElastiCache, CloudWatch Logs, Amazon S3, IAM
- Logging and Monitoring:New Relic, Loggly, PagerDuty.
- Codacy (Code quality monitoring)
- Version Control:Started with Hg(Mercurial), but migrated to Git later.
- Project Management:Jira
- Documentation:Google Docs
Technical Overview
Now let’s quickly go through some important technologies used and design decisions applied during their selection and some related challenges we handled.
Drupal-7
The core system was initially developed using Drupal 7 by the previous developers for the easiness of attaining the minimum-value-product, which makes sense. Later, we decided that the core business logic of our code-base must be kept independent from Drupal 7 as much as possible. So we had been investing a small share of developer time on refactoring the codebase to invert dependency from our business logic to drupal features as much as possible. The aim was to reach a stage where Drupal will work only as a connecting driver between HTTP and our business logic. That means Drupal’s functions/hooks must depend on our service classes and not the other way around. This design decision makes it easier and safer if the engineering team later wants to pluck the business logic out from Drupal and use it under some other framework or API.
Laravel
When a big set of new requirements came up to address a whole new class of organizations we decided to develop it as a separate app and integrate it to the core product using microservice architecture. This decision was based on the fact that the new features are going to be used only by a specific type of organization that are much bigger in size and follow much different business processes when compared to the average customer in existing customer base. The server load expected from those organizations also falls on a much different scale. Therefore developing and maintaining these features as a separate codebase made a lot of sense. Laravel was used as the framework because this app was going to be a highly complicated one that will need many of the features provided by laravel such as Templating, Form builders, Data Access Layer, Authentication, Authorization, and many other security features etc.
Zend Expressive
Expressive is a lightweight framework from Zend used for developing microservices. In the product we had developed a few private web service endpoints that serve things like adoptable animals, list of previous adopters etc to allow customers to integrate with the system and seamlessly access their data in those apps as well. Later we found that many customers are making heavy use of these web services(Hundreds of thousands of requests per week), and also demanding for more API endpoints to access more data entities in the system. Therefore we decided to split out the web APIs as separate microservices because they had different infrastructure requirements than the core product. There are a handful of microservice frameworks available out there today. Zend expressive was chosen because it is lightweight and appeared to have very good performance, better documentation and more active development going on at time of this selection.
The Results
Today the product is loved by the vast majority of its users. By the end of the 3rd year of our association the client company had already raised $2M angel funding. They continued to grow and make more users happy everyday. They were able to raise their series B funding from a VC, by the end of the 6th year of our association, and gradually hired a completely in-house team as per the requirements of their VC. The client trained, and transitioned the complete development and maintenance responsibilities to the in-house team with our full support. The entire knowledge-base related to the project as well as all digital assets were completely transferred to the ownership of the client company with all necessary legal formalities.